Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement a a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here:http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise (WLE) Dataset).
The goal of the project is to predict the manner in which the exercise was carried out. This is the classe variable in the training set, using any other variables to predict with. I will described how model has been built, including how cross validation has been used. Expected out of sample error and details of assumptions and decisions made. Finally the model will be use to predict 20 different test cases.
For this task I will use Decision Trees, Random Forests and Gradient Boosting Machines (GBMs) to fit models to the data provided after some initial cleaning. These models are chosen as they lend themselves well to classification. I will carry out a Principal Components Analysis (PCA) to test the impact of reducing the features of the data on accuracy, while this may not work as well for tree based model methods chosen, I want to learn how to apply the techniques in the course. For completion I will also create an ensemble model using the two strongest model fitted using a stacking approach to see if we can get better results using a combination of models fitted - this is a common phenomina across data science competitions.
Note: You can use the floating table of contents on the top left of the document to navigate to the relevant parts. This functions for Google Chrome.
You will need the following packages installed:
Please refer to session information in the appendix to see packages used at the time of report creation.
In order to keep the write up of the project concise, a full breakdown of data cleaning steps are included in the appendix Appenix Data. A summary of the steps are included below
str() discovered a large number of variables had a high proportion of N/A’snearZeroVar() function - this step removed 60 variablesInspecting the data set I can see that there are a number of variables which are not related to classe and are identification or timestamp variables:
Xuser_nameraw_timestamp_part_1raw_timestamp_part_2cvtd_timestampnum_window
Finally I set the variable classe as a factor using the function as.factor() for ease of future computations
The analysis may have benefited from both feature engineering and imputing missing values. However due to both time constraints and lack of subject knowledge I decided that the best course of action was to use the features as is. So we conclude the data cleaning phase with a dataset containing 53 variables.
In order to create data sets to assess accuracy of models created, we will need to create a validation data set. This is an important step and since no model will be trained on this data its a good assessment of the likely model accuracy on data held out of the model build. This will be used to assess the likely impact on the test set accuracy. I have chosen to split the train data provided and cleaned in previous step into 70% train data and 30% validation data. The validation data will be excluded from all model fitting. For this I will use the function createDataPartition() in the caret package
inTrain <- createDataPartition(df_train$classe, p = 0.7, list = F)
df_train_final <- df_train[inTrain,]; dim(df_train_final)
## [1] 13737 53
df_valid_final <- df_train[-inTrain,]; dim(df_valid_final)
## [1] 5885 53
Finally on the concept of validation, Machine learning techniques often rely on Cross Fold Validation when fitting the model on the training data. As tree based methods are low bias and high variance while training our model we will break our data into k folds, for each fold the model is built on the remaing k-1 folds, then the model is tested on kth fold and the average of the k recorded erros is the cross validation error. The model selected is the one with the lowest average cross validation error. This is made very simple using the caret package. So we’ll set this up now and we’ll use the same cross validation approach for all models we are fitting.
cv_method <- trainControl(method = "cv", number = 5)
we could experiment with different cross validation approaches which may lead us to a better model, but due to time restraints I won’t look to test out different approaches in this report.
Principal Component Analysis is a statistical procedure that converts a set of potentially correlated observations into a set of linearly uncorrelated variables called principal components.
First we check to see if there are correlations in our data by using cor() function to calculate and then corrplot() to plot these - to see if there is potential to use PCA on this data set
correlations <- cor(df_train_final[,-53]) #REMOVE NON NUMERIC VARIABLES classe
corrplot(correlations, order = "FPC", method = "color", type = "lower", tl.cex = 0.8, tl.col = "grey50")
We can see from this plot that there are some variables which are heavily correlated with others. this suggests that there may be some benefit in carrying out principal component analysis to reduce the number of variables. The caret package again makes this very easy for us using the function preProcess but we could also use prcomp() in the stats package. Here the variables are centered and scaled. the preProcess command keeps all the components that explain 95% of the variance.
pca <- preProcess(df_train_final[,-53], method = "pca")
#CARET KEEPS THE COMPONENTS THAT EXPLAIN 95% OF THE VARIANCE
df_train_final_pca <- predict(pca, df_train_final[,-53])
df_train_final_pca$classe <- df_train_final$classe
df_valid_final_pca <- predict(pca, df_valid_final[,-53])
df_valid_final_pca$classe <- df_valid_final$classe
df_test_pca <- predict(pca, df_test[,-53])
# OUTPUT PCA 1 & 2
pca$rotation[,1:2]
## PC1 PC2
## roll_belt -0.301726384 0.146053040
## pitch_belt -0.038366423 -0.290413621
## yaw_belt -0.188404289 0.262352869
## total_accel_belt -0.299480812 0.126121475
## gyros_belt_x 0.105712118 0.187047255
## gyros_belt_y -0.090397747 0.211728149
## gyros_belt_z 0.184683583 0.035613573
## accel_belt_x 0.023766707 0.293390066
## accel_belt_y -0.315362047 0.054923453
## accel_belt_z 0.311985968 -0.122195446
## magnet_belt_x -0.002060574 0.290849822
## magnet_belt_y 0.122823971 0.081144803
## magnet_belt_z 0.067472466 0.117707455
## roll_arm 0.054550696 -0.181479280
## pitch_arm 0.036594767 0.068701247
## yaw_arm 0.042974510 -0.119380535
## total_accel_arm 0.111198712 -0.042260527
## gyros_arm_x -0.005520767 0.051720332
## gyros_arm_y 0.068144563 -0.084268001
## gyros_arm_z -0.145975336 0.193374637
## accel_arm_x -0.161076646 -0.100805071
## accel_arm_y 0.260111743 -0.138654826
## accel_arm_z -0.131451710 0.004038981
## magnet_arm_x -0.084065781 -0.007403693
## magnet_arm_y 0.058993576 0.024725520
## magnet_arm_z 0.026395409 0.027272424
## roll_dumbbell 0.093504410 0.124797837
## pitch_dumbbell -0.117517655 -0.143170002
## yaw_dumbbell -0.138388193 -0.260302886
## total_accel_dumbbell 0.176154265 0.139562813
## gyros_dumbbell_x -0.007210618 -0.007767173
## gyros_dumbbell_y 0.002770346 0.038056125
## gyros_dumbbell_z 0.003928183 0.003811731
## accel_dumbbell_x -0.177452800 -0.130438788
## accel_dumbbell_y 0.190830731 0.171467951
## accel_dumbbell_z -0.166947903 -0.239673440
## magnet_dumbbell_x -0.179475473 -0.185579194
## magnet_dumbbell_y 0.156387429 0.162547521
## magnet_dumbbell_z 0.170194214 -0.030870590
## roll_forearm 0.061834998 -0.049863044
## pitch_forearm -0.146424488 -0.097065570
## yaw_forearm 0.110576540 -0.042803361
## total_accel_forearm 0.002690407 0.096696436
## gyros_forearm_x -0.062940869 0.196689755
## gyros_forearm_y 0.001773031 0.018333283
## gyros_forearm_z 0.007353039 0.021463753
## accel_forearm_x 0.185914146 -0.098848606
## accel_forearm_y 0.038381941 0.090458338
## accel_forearm_z -0.029141106 0.040862136
## magnet_forearm_x 0.103973218 -0.013893603
## magnet_forearm_y 0.025999126 0.048654460
## magnet_forearm_z -0.033233470 0.111861241
plot(df_train_final_pca[,1],df_train_final_pca[,2], col = df_train_final$classe)
# CALCULATE CUMULATIVE %
mod_prcomp <- prcomp(df_train_final[,-53], center = T, scale. = T)
#mod_prcomp$rotation[,1:2]
summary(mod_prcomp)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.8856 2.8503 2.17457 2.08178 1.90619 1.73392
## Proportion of Variance 0.1601 0.1562 0.09094 0.08334 0.06988 0.05782
## Cumulative Proportion 0.1601 0.3164 0.40730 0.49064 0.56052 0.61834
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.49347 1.43649 1.31834 1.22495 1.17547 1.06300
## Proportion of Variance 0.04289 0.03968 0.03342 0.02886 0.02657 0.02173
## Cumulative Proportion 0.66123 0.70091 0.73434 0.76319 0.78976 0.81149
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.99698 0.93321 0.90014 0.88453 0.82619 0.75172
## Proportion of Variance 0.01911 0.01675 0.01558 0.01505 0.01313 0.01087
## Cumulative Proportion 0.83061 0.84736 0.86294 0.87798 0.89111 0.90198
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.71704 0.69235 0.64343 0.62378 0.60573 0.57287
## Proportion of Variance 0.00989 0.00922 0.00796 0.00748 0.00706 0.00631
## Cumulative Proportion 0.91187 0.92108 0.92905 0.93653 0.94358 0.94990
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.5491 0.53760 0.50316 0.4838 0.44651 0.41519
## Proportion of Variance 0.0058 0.00556 0.00487 0.0045 0.00383 0.00331
## Cumulative Proportion 0.9557 0.96125 0.96612 0.9706 0.97445 0.97777
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 0.38083 0.3609 0.34735 0.33509 0.30446 0.28455
## Proportion of Variance 0.00279 0.0025 0.00232 0.00216 0.00178 0.00156
## Cumulative Proportion 0.98056 0.9831 0.98538 0.98754 0.98932 0.99088
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.25454 0.23814 0.23517 0.20303 0.1914 0.18288
## Proportion of Variance 0.00125 0.00109 0.00106 0.00079 0.0007 0.00064
## Cumulative Proportion 0.99213 0.99322 0.99428 0.99507 0.9958 0.99642
## PC43 PC44 PC45 PC46 PC47 PC48
## Standard deviation 0.17984 0.16643 0.16442 0.16407 0.14733 0.14248
## Proportion of Variance 0.00062 0.00053 0.00052 0.00052 0.00042 0.00039
## Cumulative Proportion 0.99704 0.99758 0.99810 0.99861 0.99903 0.99942
## PC49 PC50 PC51 PC52
## Standard deviation 0.11267 0.09618 0.07717 0.04638
## Proportion of Variance 0.00024 0.00018 0.00011 0.00004
## Cumulative Proportion 0.99967 0.99984 0.99996 1.00000
# ADD CUMULATIVE PLOT - IN OTHER R SESSION
vars <- apply(mod_prcomp$x, 2, var)
props <- cumsum(vars/sum(vars))
pca_plot <- as.data.frame(props)
pca_plot$PCA_Num <- seq.int(nrow(pca_plot))
names(pca_plot)[names(pca_plot) == "props"] <- "Cumulative_Variance"
# CREATE PLOT
g <- ggplot(data = pca_plot, aes(x=PCA_Num, y=Cumulative_Variance))
g + geom_line() +
geom_point() +
ggtitle("Principal Component Analysis",
subtitle = "Cumulative Variance Explained up to and including Principal Component") +
ylab("Variance Explained") +
xlab("Principal Component #") +
geom_hline(yintercept = 0.95, colour = "red", size = 1, linetype = "dashed") +
geom_text(aes(label = "95% Variance Threshold"),
size = 4, x = 10, y = 0.967, colour = "red" )
From this we can see that:
After our data cleaning, splitting and pre processing has been completed we have 6 datasets we will use in the remainder of the analysis:
df_train_final : 70% of original train data, with irrelevant, largely missing and variables with close to zero variance removed. Contains 53 variables and 13,737 rows. Will be used to train modelsdf_valid_final : 30% of original train data, with irrelevant, largely missing and variables with close to zero variance removed. Contains 53 variables and 5,885 rows. Will be used to assess expected accuracy of trained models before we apply to test datasetdf_train_final_pca : 70% of original train data, with irrelevant, largely missing and variables with close to zero variance removed and principal components analysis applied to remaining variables and classe variable appended. Contains 26 variables and 13,737 rows. Will be used to train modelsdf_valid_final_pca : 30% of original train data, with irrelevant, largely missing and variables with close to zero variance removed and principal components analysis applied to remaining variables and classe variable appended. Contains 26 variables and 5,885 rows. Will be used to assess expected accuracy of trained models before we apply to test dataset, and also to compare if PCA adds value to this exercisedf_test : Original Test data with same transformations as applied to df_train_final. 53 variables and 20 observations. Will be used to score final models and submit quizdf_test_pca : Original Test data with same transformations as applied to df_train_final_pca. 26 variables and 20 observations. Will be used to score final models and submit quiz if PCA proves more succesful than using raw data featuresWe will fit 3 types of models to these datasets and an ensemble model of the best two. The following methods will be used:
rpart and caret package - these models are very simple to understand and visualise but with low complexity we expect these models to be lowest accuracyrf method in caret - since there is a multitude of trees, the model is more difficult to interpret, but we expect it to have a higher accuracy on our validation data than a simple decision tree.Before we start modelling its important to create a seed so that results are reproducible. While I could just initialise this once for the project, I prefer to reset the seed before every new model is fitted so you will frequently see the short line of code set.seed(2808) the choice of 2808 as the seed is arbitrary.
Throughout this section I use Accuracy as the metric to assess predictive power of my models.
For the decision tree models we will use the function train() in caret which uses the rpart package, this function calls our chosen cross validation method discussed above to reduce potential overfit
# Using variables as provided
set.seed(2808)
model_tree <- train(classe ~ ., data = df_train_final, method = "rpart", tuneLength = 50, metric = "Accuracy", trControl = cv_method)
# Using PCA
set.seed(2808)
model_tree_pca <- train(classe ~ ., data = df_train_final_pca, method = "rpart", tuneLength = 50, metric = "Accuracy", trControl = cv_method)
Now we use the predict() function to score these models onto our validation data. Then we can fit a matrix using the Actual classe outcome in this validation dataset against the predicted classe using the confusionMatrix() function and use this to assess the accuracy of our models.
# Non PCA Tree
predict_tree <- predict(model_tree, newdata = df_valid_final)
CM_tree <- confusionMatrix(predict_tree, df_valid_final$classe)
CM_tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1617 64 10 9 14
## B 15 962 42 16 37
## C 11 55 938 41 20
## D 13 33 27 879 28
## E 18 25 9 19 983
##
## Overall Statistics
##
## Accuracy : 0.914
## 95% CI : (0.9066, 0.9211)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8912
## Mcnemar's Test P-Value : 2.245e-07
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9659 0.8446 0.9142 0.9118 0.9085
## Specificity 0.9770 0.9768 0.9739 0.9795 0.9852
## Pos Pred Value 0.9434 0.8974 0.8808 0.8969 0.9326
## Neg Pred Value 0.9863 0.9632 0.9817 0.9827 0.9795
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2748 0.1635 0.1594 0.1494 0.1670
## Detection Prevalence 0.2912 0.1822 0.1810 0.1665 0.1791
## Balanced Accuracy 0.9715 0.9107 0.9440 0.9457 0.9469
# PCA Tree
predict_tree_pca <- predict(model_tree_pca, newdata = df_valid_final_pca)
CM_tree_pca <- confusionMatrix(predict_tree_pca, df_valid_final_pca$classe)
CM_tree_pca
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1336 184 73 72 54
## B 112 669 113 72 156
## C 135 137 778 163 90
## D 64 72 14 601 57
## E 27 77 48 56 725
##
## Overall Statistics
##
## Accuracy : 0.6982
## 95% CI : (0.6863, 0.7099)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6179
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7981 0.5874 0.7583 0.6234 0.6701
## Specificity 0.9090 0.9046 0.8920 0.9579 0.9567
## Pos Pred Value 0.7772 0.5963 0.5971 0.7438 0.7771
## Neg Pred Value 0.9189 0.9013 0.9459 0.9285 0.9279
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2270 0.1137 0.1322 0.1021 0.1232
## Detection Prevalence 0.2921 0.1907 0.2214 0.1373 0.1585
## Balanced Accuracy 0.8536 0.7460 0.8251 0.7907 0.8134
We can see from the output above that the tree fitted on the variables as provided has an accuracy of 91.4% while the model fitted on the data reduced used PCA has an accuracy of 69.82%
This shows us there is an expected drop off in prediction power if we use PCA on the data to reduce the number of features. Also an Accuracy of 91.4% is not a bad start point given a simplistic model.
Extending a tree to a forest, we expect a jump in predictive power.
# NON PCA
set.seed(2808)
model_rf <- train(classe ~., data = df_train_final, method = "rf", trControl = cv_method)
# PCA
set.seed(2808)
model_rf_pca <- train(classe ~., data = df_train_final_pca, method = "rf", trControl = cv_method)
Again predicting these on to the validation dataset and assessing accuracy.
#PREDICT MODELS ONTO VALIDATION DATA
predict_rf <- predict(model_rf, newdata = df_valid_final)
predict_rf_pca <- predict(model_rf_pca, newdata = df_valid_final_pca)
#NON PCA confusion Matrix
CM_rf <- confusionMatrix(predict_rf, df_valid_final$classe)
CM_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 2 0 0 0
## B 1 1136 9 0 0
## C 0 1 1016 12 1
## D 0 0 1 951 4
## E 0 0 0 1 1077
##
## Overall Statistics
##
## Accuracy : 0.9946
## 95% CI : (0.9923, 0.9963)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9931
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9974 0.9903 0.9865 0.9954
## Specificity 0.9995 0.9979 0.9971 0.9990 0.9998
## Pos Pred Value 0.9988 0.9913 0.9864 0.9948 0.9991
## Neg Pred Value 0.9998 0.9994 0.9979 0.9974 0.9990
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1930 0.1726 0.1616 0.1830
## Detection Prevalence 0.2846 0.1947 0.1750 0.1624 0.1832
## Balanced Accuracy 0.9995 0.9976 0.9937 0.9927 0.9976
#PCA confusion Matrix
CM_rf_pca <- confusionMatrix(predict_rf_pca, df_valid_final_pca$classe)
CM_rf_pca
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1667 18 3 1 0
## B 3 1103 18 1 4
## C 1 15 997 29 8
## D 2 0 7 931 9
## E 1 3 1 2 1061
##
## Overall Statistics
##
## Accuracy : 0.9786
## 95% CI : (0.9746, 0.9821)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9729
## Mcnemar's Test P-Value : 4.101e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9958 0.9684 0.9717 0.9658 0.9806
## Specificity 0.9948 0.9945 0.9891 0.9963 0.9985
## Pos Pred Value 0.9870 0.9770 0.9495 0.9810 0.9934
## Neg Pred Value 0.9983 0.9924 0.9940 0.9933 0.9956
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2833 0.1874 0.1694 0.1582 0.1803
## Detection Prevalence 0.2870 0.1918 0.1784 0.1613 0.1815
## Balanced Accuracy 0.9953 0.9815 0.9804 0.9811 0.9896
The Accuracy here is much better, as we’d expect, the model fitted on the variables as provided has an accuracy of 99.46% while the model fitted on the data reduced used PCA has an accuracy of 97.86%.
We can see that the predictive power of PCA based model has closed the gap considerably on the model based on variables as provided, however it still lags.
With the accuracy of the Random Forest model so high, we could have simply used it to predict the classes of our 20 test cases, however for completion and to enable me to better understand the concepts of the course, I’ll continue as planned. Again, we fit the models.
# NON PCA MODEL
set.seed(2808)
model_gbm <- train(classe ~., data = df_train_final, method = "xgbTree", trControl = cv_method)
# PCA MODEL
set.seed(2808)
model_gbm_pca <- train(classe ~., data = df_train_final_pca, method = "xgbTree", trControl = cv_method)
Using these models to predict the outcome of the validation data and assessing accuracy of predictions vs actual values.
#PREDICT VALIDATION DATA OUTCOMES
predict_gbm <- predict(model_gbm, newdata = df_valid_final)
predict_gbm_pca <- predict(model_gbm_pca, newdata = df_valid_final_pca)
#ACCURACY METRICS - NON PCA
CM_gbm <- confusionMatrix(predict_gbm, df_valid_final$classe)
CM_gbm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 2 0 0 0
## B 1 1137 2 3 0
## C 0 0 1020 11 1
## D 0 0 4 949 2
## E 0 0 0 1 1079
##
## Overall Statistics
##
## Accuracy : 0.9954
## 95% CI : (0.9933, 0.997)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9942
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9982 0.9942 0.9844 0.9972
## Specificity 0.9995 0.9987 0.9975 0.9988 0.9998
## Pos Pred Value 0.9988 0.9948 0.9884 0.9937 0.9991
## Neg Pred Value 0.9998 0.9996 0.9988 0.9970 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1932 0.1733 0.1613 0.1833
## Detection Prevalence 0.2846 0.1942 0.1754 0.1623 0.1835
## Balanced Accuracy 0.9995 0.9985 0.9958 0.9916 0.9985
#ACCURACY METRICS - PCA
CM_gbm_pca <- confusionMatrix(predict_gbm_pca, df_valid_final_pca$classe)
CM_gbm_pca
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1631 45 6 6 6
## B 14 1041 35 6 23
## C 13 35 951 51 19
## D 12 5 28 893 19
## E 4 13 6 8 1015
##
## Overall Statistics
##
## Accuracy : 0.9398
## 95% CI : (0.9335, 0.9458)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9239
## Mcnemar's Test P-Value : 7.276e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9743 0.9140 0.9269 0.9263 0.9381
## Specificity 0.9850 0.9836 0.9757 0.9870 0.9935
## Pos Pred Value 0.9628 0.9303 0.8896 0.9331 0.9704
## Neg Pred Value 0.9897 0.9794 0.9844 0.9856 0.9862
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2771 0.1769 0.1616 0.1517 0.1725
## Detection Prevalence 0.2879 0.1901 0.1816 0.1626 0.1777
## Balanced Accuracy 0.9797 0.9488 0.9513 0.9567 0.9658
The Accuracy here is slightly better than the Random forest models, the model fitted on the variables as provided has an accuracy of 99.54% while the model fitted on the data reduced used PCA has an accuracy of 93.98%.
While either of the Random Forest model or GBM models using the actual variables in the models rather than the reduced variable set under PCA would be sufficient for the testing datasets given accuracies of over 90%, we will attempt to model stack, we don’t really expect much of an improvement in accuracy of models, again I’m just using this as an opportunity to apply some of the theory from the course.
The steps of fitting a stacked model are as follows:
#GET PREDITIONS FROM GBM MODEL FOR TRAINING DATASET
predict_gbm_stack <- predict(model_gbm, newdata = df_train_final)
#GET PREDITIONS FROM RANDOM FOREST MDOEL FOR TRAINING DATASET
predict_rf_stack <- predict(model_rf, newdata = df_train_final)
#CREATE A DATAFRAME TO COLLECT MODEL PREDICTIONS AND CLASSE FACTOR TOGETHER
df_stacked <- data.frame(predict_gbm = predict_gbm_stack, predict_rf = predict_rf_stack, classe = df_train_final$classe)
#NOW FIT MODEL
set.seed(2808)
model_stacked <- train(classe ~ predict_gbm + predict_rf, data = df_stacked, trControl = cv_method, method = "xgbTree")
Now Predict onto validation data and assess accuracy
#PREDICT ONTO VALIDATION DATA
predict_stacked <- predict(model_stacked, newdata = df_valid_final)
#TEST ACCURACY OF STACKED MODEL
CM_stacked <- confusionMatrix(predict_stacked, df_valid_final$classe)
CM_stacked
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 2 0 0 0
## B 1 1137 2 3 0
## C 0 0 1020 11 1
## D 0 0 4 949 2
## E 0 0 0 1 1079
##
## Overall Statistics
##
## Accuracy : 0.9954
## 95% CI : (0.9933, 0.997)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9942
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9982 0.9942 0.9844 0.9972
## Specificity 0.9995 0.9987 0.9975 0.9988 0.9998
## Pos Pred Value 0.9988 0.9948 0.9884 0.9937 0.9991
## Neg Pred Value 0.9998 0.9996 0.9988 0.9970 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1932 0.1733 0.1613 0.1833
## Detection Prevalence 0.2846 0.1942 0.1754 0.1623 0.1835
## Balanced Accuracy 0.9995 0.9985 0.9958 0.9916 0.9985
We can see that the stacked model has an accuracy of 99.54% the same as the GBM model accuracy of 99.54% and the Random Forest model accuracy of 99.46%.
We see in Kaggle competitions the winning entry is typically a stacked model so that we expect stacked models to outperform models from single methods. Why is this not the case here?
To investigate this last point really quickly we can look at the validation data actual classe predictions versus the predictions of the GBM and random forest model - is it clear that one of these models is superior to the other where the prediction differs?
First create a dataset containing the predictions of the two models on the validation data and the actual classe variable. Now create a subset of this dataset where the prediction for the GBM model is different to Random forest model.
# LETS SEE WHERE PREDICTIONS DIFFER BETWEEN GBM AND RF MODEL...
df_valid_stack <- data.frame(predict_gbm = predict_gbm, predict_rf = predict_rf, classe = df_valid_final$classe)
df_stack_mismatch <- df_valid_stack[df_valid_stack$predict_gbm != predict_rf,]
There are only 39 instances out of 5885 rows in the validation dataset where this mismatch occurs so less than 1%. Lets look at the first few rows of this data. Then finally calculate the percentage of mismatches which are correctly predicted by the GBM model.
head(df_stack_mismatch)
## predict_gbm predict_rf classe
## 638 A B A
## 1174 B A A
## 1837 A B B
## 2398 B A B
## 2404 A B B
## 2590 B A B
sum(df_stack_mismatch$predict_gbm == df_stack_mismatch$classe) / dim(df_stack_mismatch)[1]
## [1] 0.5641026
this shows us that 0.564102564102564% of the mismatches are better predicted by the GBM model, hence the GBM model is superior to the Random Forest model as we seen earlier, but also that means its difficult for the stacked model to improve on the expected gbm accuracy.
As we have fitted a lot of models above, pull all the results into one data frame and plotting to summarise the results above.
Accuracy <- c(CM_tree$overall[1], CM_tree_pca$overall[1], CM_rf$overall[1], CM_rf_pca$overall[1], CM_gbm$overall[1], CM_gbm_pca$overall[1], CM_stacked$overall[1] )
PCA_Model <- c("N", "Y", "N", "Y","N", "Y", "N")
X <- c(1,1,2,2,3,3,4)
Model_Name <- c("Decision Tree", "Decision Tree", "Random Forest", "Random Forest", "GBM", "GBM", "Ensemble")
df_results <- data.frame(Model_Name, PCA_Model, Accuracy, X)
library(ggrepel)
ggplot(data = df_results, aes(x=X, y = Accuracy, color = PCA_Model)) +
geom_point() +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
ggtitle("Model Accuracy",
subtitle = "Expected Accuracy based on Validation Data") +
ylab("Accuracy") +
geom_label_repel(aes(label= paste0(Model_Name,";", round(Accuracy,4))),
arrow = arrow(length = unit(0.03, "npc"), type = "closed", ends = "first"),
force = 10,
box.padding = 0.35, point.padding = 0.5, label.size = 0.1, show.legend = F) +
ylim(0.6,1.05)
This shows:
We will therefore Select the Model fitted using a GBM approach to predict the outcome of the test data, we expect the accuracy of the prediction to be 99.54%.
so now we’ve selected our model, lets find out a little more detail about it. Using the function VarImp() in the Caret package we can see which features in the model have most influence in the model. Importance is calculated for a single decision tree by the amount each attribute split point improves the performance measre, weighted by number of observations node is responsible for. The Feature imoprtances are then averaged across all of the weak predictors in the model. Lets plot:
importance <- varImp(model_gbm, scale = F)
plot(importance, main = "GBM Model: Feature Importance")
This shows us the following: * the most important features in the mode are the variables yaw_belt and roll_belt * There are a large number of factors used in the model - highlighting the issue with transparency of Machine Learning models
Lets look at the first 2 trees fitted in the GBM model using the function xgb.plot.tree() in the xgboost package
xgb.plot.tree(model = model_gbm$finalModel, trees =1:2 )
This shows the first 2 trees fitted in the gbm model, which gives us some insight into the complexity of the model, however it only touches on it as there are 749 trees in total fitted to the data.
The final part of the project involves predicting the outcome of the 20 test cases in the test dataset. As discussed in the summary section we will use the GBM model fitted on the variable as provided (ie. not using PCA).
predict_test <- data.frame(problem_id = df_test$problem_id, predicted_classe = predict(model_gbm, newdata = df_test))
predict_test
## problem_id predicted_classe
## 1 1 B
## 2 2 A
## 3 3 B
## 4 4 A
## 5 5 A
## 6 6 E
## 7 7 D
## 8 8 B
## 9 9 A
## 10 10 A
## 11 11 B
## 12 12 C
## 13 13 B
## 14 14 A
## 15 15 E
## 16 16 E
## 17 17 A
## 18 18 B
## 19 19 B
## 20 20 B
First we’ll import the data and run some simple summaries using dim() and str() functions
df_train <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"),header=TRUE)
df_test <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"),header=TRUE)
dim(df_train) #19622 Entries; 160 variables
## [1] 19622 160
dim(df_test) #20 Entries; 160 variables
## [1] 20 160
str(df_train)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_belt : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt.1 : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_arm : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_arm : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_arm : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_arm : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_arm : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_dumbbell : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
Here we can see a large number of columns has N/A Values, so in order to quickly clean the data and remove the number of variables we are working with, we will remove the variables which have N/A proportions over 90%. Note that if we were carrying out a full analysis with less time constraints we might investigate to see if any of these variables, despite limited data, may improve model. But first we will use the very hand nearZeroVar() function to remove variables with little or no variation in their response by row, which will not be useful in predicting outcome
# CLEAN PART 1: USE nearZeroVar TO REDUCE NUMBER OF VARIABLES
remove_nzv <- nearZeroVar(df_train)
length(remove_nzv) #WILL REMOVE 60 VARIABLES
## [1] 60
df_train <- df_train[,-remove_nzv] #REMOVE COLUMNS WITH CLOSE TO ZERO VARIANCE
df_test <- df_test[, -remove_nzv] #REPLICATE DATA PROCESSING IN TEST
# CLEAN PART 2: USE PROPORTION OF N/As or NULL TO FURTHER REDUCE NUMBER OF VARIABLES
NA_95pc <- sapply(df_train, function(x) mean(is.na(x))) > 0.95
summary(NA_95pc) #WILL REMOVE A FURTHER 41 VARIABLES
## Mode FALSE TRUE
## logical 59 41
df_train <- df_train[,NA_95pc == FALSE] #REMOVE COLUMNS WITH 95% NA FROM TRAIN
df_test <- df_test[,NA_95pc == FALSE] #REPLICATE DATA PROCESSING IN TEST
dim(df_train)
## [1] 19622 59
dim(df_test)
## [1] 20 59
# CLEAN PART 3: REMOVE IDENTIFICATION / TIMESTAMP COLUMNS
df_train <- df_train[, -c(1:6)]; df_train$classe <- as.factor(df_train$classe)
df_test <- df_test[,-c(1:6)]
str(df_train)
## 'data.frame': 19622 obs. of 53 variables:
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(df_test)
## [1] 20 53
Using these data cleansing approaches we are left with a much more manageable dataset with 19622 rows and 53 variables. The cleaning actions have been replicated in the test dataset
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Ireland.1252 LC_CTYPE=English_Ireland.1252
## [3] LC_MONETARY=English_Ireland.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Ireland.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DiagrammeR_1.0.0 xgboost_0.71.2 corrplot_0.84
## [4] randomForest_4.6-14 rpart.plot_3.0.6 rpart_4.1-13
## [7] rattle_5.2.0 caret_6.0-81 ggplot2_3.0.0.9000
## [10] lattice_0.20-35
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.18 bindr_0.1.1 knitr_1.20
## [4] magrittr_1.5 MASS_7.3-50 splines_3.5.1
## [7] hms_0.4.2 tidyselect_0.2.4 viridisLite_0.3.0
## [10] colorspace_1.3-2 R6_2.2.2 rlang_0.3.0.1
## [13] brew_1.0-6 foreach_1.4.4 visNetwork_2.0.5
## [16] Rook_1.1-1 gower_0.1.2 influenceR_0.1.0
## [19] withr_2.1.2 iterators_1.0.10 htmltools_0.3.6
## [22] class_7.3-14 recipes_0.1.4 assertthat_0.2.0
## [25] rprojroot_1.3-2 digest_0.6.17 tibble_1.4.2
## [28] Matrix_1.2-14 rmarkdown_1.10 downloader_0.4
## [31] compiler_3.5.1 pillar_1.3.0 scales_1.0.0
## [34] backports_1.1.2 generics_0.0.2 stats4_3.5.1
## [37] jsonlite_1.5 lubridate_1.7.4 pkgconfig_2.0.2
## [40] igraph_1.2.2 rstudioapi_0.7 munsell_0.5.0
## [43] prodlim_2018.04.18 plyr_1.8.4 dplyr_0.7.6
## [46] stringr_1.3.1 tools_3.5.1 grid_3.5.1
## [49] nnet_7.3-12 ipred_0.9-8 nlme_3.1-137
## [52] timeDate_3043.102 data.table_1.11.8 gtable_0.2.0
## [55] lazyeval_0.2.1 yaml_2.2.0 survival_2.42-3
## [58] crayon_1.3.4 bindrcpp_0.2.2 gridExtra_2.3
## [61] lava_1.6.4 tidyr_0.8.1 purrr_0.2.5
## [64] readr_1.2.1 reshape2_1.4.3 RColorBrewer_1.1-2
## [67] viridis_0.5.1 ModelMetrics_1.2.2 codetools_0.2-15
## [70] htmlwidgets_1.3 glue_1.3.0 evaluate_0.11
## [73] rgexf_0.15.3 stringi_1.1.7 XML_3.98-1.16